In recent decades, Philadelphia has experienced substantial changes in population dynamics, economic fluctuations, and redevelopment pressures. Among these changes, housing values have become a key indicator of the socio-economic conditions within the city’s neighborhoods. Rising housing values typically reflect a higher presence of wealthier residents or a neighborhood undergoing gentrification, while declining housing values often signal increased economic pressure or decline. Therefore, accurately forecasting the median housing values across Philadelphia is crucial for urban planners and policymakers, helping them to proactively address potential risks such as displacement and disinvestment. Supporting Philadelphia’s long-term vision of fostering vibrant, diverse, and resilient communities is also essential.
To better understand the potential factors influencing housing values in Philadelphia, we identify several key variables: educational attainment, vacancy rates, the proportion of single-family homes, and poverty levels. These factors are closely related to housing market trends and provide insights into the neighborhood’s socio-economic status.
Educational attainment is closely linked to the socio-economic characteristics of neighborhoods. Individuals with higher educational attainment typically earn higher incomes and contribute more to local economies. A higher concentration of individuals with advanced educational backgrounds tends to increase the demand for well-designed housing in affluent neighborhoods. With a consistent supply, housing prices and values are likely to rise. Similarly, a high poverty rate also indicates a declining community. Since many residents cannot afford luxury housing, the housing values in those neighborhoods are likely to be Lower.
High vacancy rates often correlate with declining neighborhoods and reduced median house values. Research shows that vacant properties affect multiple facets of community life, such as housing and neighborhood vitality, crime prevention efforts, and the well-being of commercial districts. As a result, areas with numerous vacant housing units typically see diminished median housing values.
The proportion of single-family homes in an area influences housing values. These homes are generally private and comfortable, making them more desirable in many U.S. housing markets. However, they are relatively common in suburban neighborhoods and may suffer from low accessibility and limited infrastructure, which can negatively affect their property values as well.
In this study, we utilize ordinary least squares (OLS) regression to analyze the relationship between these socioeconomic factors and median house values in Philadelphia. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives.
To predict median house values in Philadelphia, we obtained the original dataset from the United States Census data. The dataset represents census block groups from the year 2000 and initially contained 1,816 observations. The key variables included:
To refine the dataset for modeling purposes, we applied the following filtering criteria:
Additionally, we removed a specific block group in North Philadelphia that exhibited inconsistencies, with an unusually high median house value (over $800,000) despite a very low median household income (less than $8,000).
After data cleaning, the final dataset contained 1,720 observations.
We will first examine the summary statistics of key variables in the dataset, including the dependent variables MEDHVAL (Median House Value), and predictors NBELPOV100 (Households Living in Poverty), PCTBACHMOR (% of Individuals with Bachelor’s Degrees or Higher), PCTVACANT(% of Vacant Houses), PCTSINGLES( % of Single House Units).
We will examine the mean and standard deviation (SD) of key variables in the dataset.
The mean (\(\bar{X}\)) represents the average value of a variable and is calculated as:
\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
where:
- \(X_i\) represents each individual
observation
- \(n\) is the total number of
observations
The mean gives us a single representative value of the dataset, which helps in understanding the typical value for a given variable.
To measure variability, we use the standard deviation (SD), which quantifies how much the values in a dataset deviate from the mean. The formula for the sample standard deviation (\(s\)) is:
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \]
where:
- \(X_i\) represents each individual
observation
- \(\bar{X}\) is the mean of the
observations
- \(n\) is the total number of
observations
A larger standard deviation indicates that the data points are more spread out, while a smaller standard deviation suggests that the data points are closer to the mean.
We will also examine the histograms and apply log transformations for key variables to assess whether the transformed variables follow a more normal-like distribution.
Histograms provide a visual representation of how a variable’s values are distributed, helping to identify whether the data follows a normal distribution, is right-skewed, or left-skewed. Linear regression model assume that variables are approximately normally distributed.
If the histogram is right-skewed, it suggests that a
small number of observations have significantly higher values compared
to the rest. For variables following a right-skewed
distribution, we will apply a log
transformation to these variables to improve normality. Since
the log transformation is undefined for zero or
negative values, we must first check whether any variable contains
zero.
- If no zeros are present, we apply the standard log
transformation:
\[
X' = \log_{10} (X)
\]
- If zeros are present, we adjust by adding 1 before
taking the log:
\[
X' = \log_{10} (X + 1)
\]
This ensures that all values remain positive and avoids undefined
values.
By comparing the original histograms with the log-transformed histograms, we will assess whether the transformation improves the suitability of the data for predictive modeling.
We will analyze correlations between predictors, to detect potential multicollinearity before proceeding with regression analysis, which can distort model interpretations.
Multicollinearity occurs when predictors are highly correlated, which can lead to unstable regression coefficients. It also inflates standard errors, reducing the statistical significance of predictors, and increases the risk of overfitting, as redundant variables do not contribute new information to the model.
The correlation coefficient \(r\) is calculated as:
\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} \] In a more concise way, this above formula is also equivalent to the following:
\[
r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{S_x}
\right) \left( \frac{y_i - \bar{y}}{S_y} \right)
\] where:
- \(X_i\) and \(Y_i\) are individual data points for
variables \(X\) and \(Y\), respectively.
- \(\bar{X}\) and \(\bar{Y}\) are the mean
values of \(X\) and \(Y\).
- The numerator represents the covariance between \(X\) and \(Y\), while the denominator
normalizes the values.
The correlation coefficient \(r\) ranges from -1 to 1: A value of +1 indicates a perfect positive linear relationship, while a value of -1 indicates a perfect negative linear relationship. A value of 0 suggests no linear relationship between the variables, meaning changes in one do not influence the other.
dependent_var <- "MEDHVAL"
predictors <- c("PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES")
summary_stats <- data %>%
dplyr::select(all_of(c(dependent_var, predictors))) %>%
summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
pivot_wider(names_from = Stat, values_from = Value)
summary_stats$Variable <- recode(summary_stats$Variable,
"MEDHVAL" = "Median House Value",
"NBELPOV100" = "# Households Living in Poverty",
"PCTBACHMOR" = "% of Individuals with Bachelor’s Degrees or Higher",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
)
summary_stats <- summary_stats %>%
mutate(
Mean = round(Mean, 2),
SD = round(SD, 2)
)
summary_stats <- summary_stats %>%
arrange(Variable == "Median House Value")
predictor_rows <- which(summary_stats$Variable != "Median House Value")
dependent_rows <- which(summary_stats$Variable == "Median House Value")
# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred <- max(predictor_rows)
start_dep <- min(dependent_rows)
end_dep <- max(dependent_rows)
# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
add_header_above(c(" " = 1, "Statistics" = 2)) %>%
kable_styling(full_width = FALSE) %>%
group_rows("Predictors", start_pred, end_pred) %>%
group_rows("Dependent Variable", start_dep, end_dep)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| Variable | Mean | SD |
|---|---|---|
| Predictors | ||
| % of Individuals with Bachelor’s Degrees or Higher | 16.08 | 17.77 |
| # Households Living in Poverty | 189.77 | 164.32 |
| % of Vacant Houses | 11.29 | 9.63 |
| % of Single House Units | 9.23 | 13.25 |
| Dependent Variable | ||
| Median House Value | 66287.73 | 60006.08 |
#check 0
columns_to_check <- c(dependent_var, predictors)
zero_counts <- sapply(data[columns_to_check], function(x) sum(x == 0, na.rm = TRUE))
zero_counts[zero_counts > 0]## PCTBACHMOR NBELPOV100 PCTVACANT PCTSINGLES
## 143 33 163 306
data <- data %>%
mutate(
LNMEDHVAL = log(MEDHVAL),
LNPCTBACHMOR = log(1+PCTBACHMOR),
LNNBELPOV100 = log(1+NBELPOV100),
LNPCTVACANT = log(1+PCTVACANT),
LNPCTSINGLES = log(1+PCTSINGLES)
)longer_version<- data %>%
pivot_longer(cols = c("MEDHVAL", "PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "black", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"MEDHVAL" = "Median House Value",
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"NBELPOV100" = "# Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))# histograms of the transformed variables
longer_version2 <- data %>%
pivot_longer(cols = c(LNMEDHVAL, LNPCTBACHMOR ,LNNBELPOV100,LNPCTVACANT, LNPCTSINGLES),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version2,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "red", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"LNMEDHVAL" = "Log Median House Value",
"LNPCTBACHMOR" = "Log % with Bachelor’s Degree",
"LNNBELPOV100" = "Log # Households in Poverty",
"LNPCTVACANT" = "Log % Vacant Houses",
"LNPCTSINGLES" = "Log % Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and log transformed Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))ggplot(shape) +
geom_sf(aes(fill = LNMEDHVAL), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "LNMEDHVAL",
na.value = "transparent") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)) +
labs(title = "Log Transformed Median House Value")shpe_longer<- shape %>%
pivot_longer(cols = c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV"),
names_to = "Variable",
values_to = "Value")
custom_titles <- c(
PCTVACANT = "Percent of Vacant Houses",
PCTSINGLES = "Percent of Single House Units",
PCTBACHMOR = "Percent of Bachelor's Degree or Higher",
LNNBELPOV = "Logged Transformed Poverty Rate"
)
plot_list <- lapply(unique(shpe_longer$Variable), function(var_name) {
data_subset <- subset(shpe_longer, Variable == var_name)
ggplot(data_subset) +
geom_sf(aes(fill = Value), color = "transparent") +
scale_fill_gradientn(
colors = c("#fff0f3", "#a4133c"),
name = var_name,
na.value = "transparent"
) +
labs(title = custom_titles[[var_name]]) +
theme(
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
legend.key.size = unit(0.3, "cm"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 15, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)
)
})
# Combine the plots into a grid (2 columns by 2 rows)
combined_plot <- (plot_list[[1]] + plot_list[[2]]) /
(plot_list[[3]] + plot_list[[4]])
combined_plot##
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV100, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25825 -0.20391 0.03822 0.21744 2.24347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1137661 0.0465330 238.836 < 2e-16 ***
## PCTVACANT -0.0191569 0.0009779 -19.590 < 2e-16 ***
## PCTSINGLES 0.0029769 0.0007032 4.234 2.42e-05 ***
## PCTBACHMOR 0.0209098 0.0005432 38.494 < 2e-16 ***
## LNNBELPOV100 -0.0789054 0.0084569 -9.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared: 0.6623, Adjusted R-squared: 0.6615
## F-statistic: 840.9 on 4 and 1715 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: LNMEDHVAL
## Df Sum Sq Mean Sq F value Pr(>F)
## PCTVACANT 1 180.392 180.392 1343.087 < 2.2e-16 ***
## PCTSINGLES 1 24.543 24.543 182.734 < 2.2e-16 ***
## PCTBACHMOR 1 235.118 235.118 1750.551 < 2.2e-16 ***
## LNNBELPOV100 1 11.692 11.692 87.054 < 2.2e-16 ***
## Residuals 1715 230.344 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted_values <- fitted(fit)
residuals_values <- residuals(fit)
standardized_residuals <- rstandard(fit)
data <- data %>%
mutate(
Fitted = fitted_values,
Residuals = residuals_values,
Standardized_Residuals = standardized_residuals)ggplot(data, aes(x = Fitted, y = Standardized_Residuals)) +
geom_point(color = "black", size= 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Scatter Plot of Standardized Residuals vs Fitted Values",
x = "Predicted Values",
y = "Standardized Residuals"
) +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))ggplot(data, aes(x = Standardized_Residuals)) +
geom_histogram(bins = 30, fill = "black") +
labs(title = "Histogram of Standardized Residuals",
x = "Standardized Residuals",
y = "Frequency") +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))longer<-data %>%
pivot_longer(cols = c("PCTBACHMOR", "LNNBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer,aes(x = Value, y = LNMEDHVAL)) +
geom_point(color = "black", size= 0.4) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c(
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"LNNBELPOV100" = "Logged Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) +
labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
x = "Predictor Value",
y = "Log of Median House Value")join<- data %>%
dplyr::select(POLY_ID, Standardized_Residuals)
shape <- shape %>%
left_join(join, by = c("POLY_ID" = "POLY_ID"))
ggplot(shape)+
geom_sf(aes(fill = Standardized_Residuals), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "Std Residuals",
na.value = "transparent") + # Choose a color palette, invert direction if needed
labs(title = "Choropleth Map of Standardized Residuals") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))custom_labels <- c(
"% of Individuals with Bachelor’s Degrees or Higher" = "PCTBACHMOR",
"% of Vacant Houses" = "PCTVACANT",
"% of Single House Units" = "PCTSINGLES",
"# Households Living in Poverty" = "LNNBELPOV100"
)
predictor_vars <- data[, c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV100")]
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
print(cor_matrix)## PCTVACANT PCTSINGLES PCTBACHMOR LNNBELPOV100
## PCTVACANT 1.0000000 -0.1513734 -0.2983580 0.2495470
## PCTSINGLES -0.1513734 1.0000000 0.1975461 -0.2905159
## PCTBACHMOR -0.2983580 0.1975461 1.0000000 -0.3197668
## LNNBELPOV100 0.2495470 -0.2905159 -0.3197668 1.0000000
rownames(cor_matrix) <- names(custom_labels)
colnames(cor_matrix) <- names(custom_labels)
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#d73027", "white", "#1a9850"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))## Start: AIC=-3448.07
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Df Sum of Sq RSS AIC
## <none> 230.34 -3448.1
## - PCTSINGLES 1 2.407 232.75 -3432.2
## - LNNBELPOV100 1 11.692 242.04 -3364.9
## - PCTVACANT 1 51.546 281.89 -3102.7
## - PCTBACHMOR 1 199.020 429.36 -2379.0
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Final Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1715 230.3435 -3448.073
lm <- trainControl(method = "cv", number = 5)
cvlm_model <- train(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100, data=data, method = "lm", trControl = lm)
print(cvlm_model)## Linear Regression
##
## 1720 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3666659 0.6618442 0.2724531
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
cvlm_model_reduced = train(LNMEDHVAL ~ PCTVACANT + MEDHHINC, data = data, method = "lm", trControl = lm)
print(cvlm_model_reduced)## Linear Regression
##
## 1720 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4426041 0.5112366 0.3186029
##
## Tuning parameter 'intercept' was held constant at a value of TRUE